Joint Event Detection and Description in Continuous Video Streams

نویسندگان

Huijuan Xu

Boyang Li

Vasili Ramanishka

Leonid Sigal

Kate Saenko

چکیده

As a fine-grained video understanding task, dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers and proposes variable-length temporal events based on pooled features. In order to explicitly model temporal relationships between visual events and their captions in a single video, we propose a two-level hierarchical LSTM module that transcribes the event proposals into captions. Unlike existing dense video captioning approaches, our proposal generation and language captioning networks are trained end-to-end, allowing for improved temporal segmentation. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by most language generation metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Action Change Detection in Video Based on HOG

Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition problem. Many of these frameworks suppose that each video sequence includes only one ...

متن کامل

Continuous Tracking Within and Across Camera Streams

This paper presents a new approach for continuous tracking of moving objects observed by multiple, heterogeneous cameras. Our approach simultaneously processes video streams from stationary and Pan-Tilt-Zoom cameras. The detection of moving objects from moving camera streams is performed by defining an adaptive background model that takes into account the camera motion approximated by an affine...

متن کامل

State Space Approaches for Modeling Activities in Video Streams

Title of dissertation: STATE SPACE APPROACHES FOR MODELING ACTIVITIES IN VIDEO STREAMS Naresh P. Cuntoor Doctor of Philosophy, 2006 Dissertation directed by: Professor Rama Chellappa Department of Electrical and Computer Engineering The objective is to discern events and behavior in activities using video sequences, which conform to common human experience. It has several applications such as r...

متن کامل

Joint processing of audio and visual information for multimedia indexing and human-computer interaction

Information fusion in the context of combining multiple streams of data e.g., audio streams and video streams corresponding to the same perceptual process is considered in a somewhat generalized setting. Speci cally, we consider the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of descriptors e.g., speech recognition/transcription,...

متن کامل

Recognizing Complex Events Using Large Margin Joint Low-Level Event Model

In this paper we address the challenging problem of complex event recognition by using low-level events. In this problem, each complex event is captured by a long video in which several low-level events happen. The dataset contains several videos and due to the large number of videos and complexity of the events, the available annotation for the low-level events is very noisy which makes the de...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1802.10250 شماره

صفحات -

تاریخ انتشار 2018

Joint Event Detection and Description in Continuous Video Streams

نویسندگان

چکیده

منابع مشابه

Action Change Detection in Video Based on HOG

Continuous Tracking Within and Across Camera Streams

State Space Approaches for Modeling Activities in Video Streams

Joint processing of audio and visual information for multimedia indexing and human-computer interaction

Recognizing Complex Events Using Large Margin Joint Low-Level Event Model

عنوان ژورنال:

اشتراک گذاری